generating proposals for new samples from a conditional proposal distribution\(q(y | x)\);
accepting or rejecting those proposals.
The Metropolis-Hastings Algorithm
Given \(X_t = x_t\):
Generate \(Y_t \sim q(y | x_t)\);
Set \(X_{t+1} = Y_t\) with probability \(\rho(x_t, Y_t)\), where \[\rho(x, y) = \min \left\{\frac{\pi(y)}{\pi(x)}\frac{q(x | y)}{q(y | x)}, 1\right\},\] else set \(X_{t+1} = x_t\).
How Simple Is That?
The devil is in the details: performance and efficiency are highly dependent on the choice of \(q\).
Key: There is a tradeoff between exploration and acceptance.
Wide proposal: Can make bigger jumps, may be more likely to reject proposals.
Narrow proposal: More likely to accept proposals, may not “mix” efficiently.
Proposal Distribution Choice
The original Metropolis et al. (1953) algorithm used symmetric distributions (\(q(y | x) = q(x | y)\)).
Then the acceptance probability reduces to \[\rho = \min \left\{\frac{\pi(y)}{\pi(x)}, 1\right\}.\]
A common choice: \(y \sim \text{Normal}(X_t, \sigma)\) centered around the current point \(X_t\).
Julia Implementation
functionmh_transition(x_current, σ)# generate new proposal x_proposal =rand(Normal(x_current, σ)) u =rand() ρ =log(target_density(x_proposal)) -log(target_density(x_current)) # transition log-probabilityiflog(u) <min(ρ, 1) y = x_proposalelse y = x_currentendreturn y, log(target_density(y))end
\[
\begin{gather}
y = a + bx + \varepsilon \\
\varepsilon \sim \text{Normal}(0, \sigma).
\end{gather}
\]
This makes the likelihood: \[
y \sim \text{Normal}(a+bx, \sigma).
\]
Prior Selection
\[
\begin{gather}
a \sim \text{Normal(0, 2)} \\
b \sim \text{Normal(0, 2)} \\
\sigma \sim \text{Half-Normal}(0, 1)
\end{gather}
\]
Code
# generate data for samplesfunctiongen_data(a, b, σ) x =collect(0:20) y = a .+ b * x# sample and add noise ε =rand(Normal(0, σ), length(x)) y .+= εreturn yend# sample and plotn_samples =1000a =rand(Normal(0, 2), n_samples)b =rand(Normal(0, 2), n_samples)σ =rand(truncated(Normal(0, 1), lower=0), 1000)y_prior = [gen_data(a[i], b[i], σ[i]) for i in1:n_samples]# convert y to a Matrix by vcatting each vectory_prior =mapreduce(permutedims, vcat, y_prior) plt_prior_1 =plot(; ylabel=L"$y$", xlabel=L"$x$", tickfontsize=16, legendfontsize=16, guidefontsize=18, bottom_margin=5mm, left_margin=5mm, legend=false)for x ∈ [0, 5, 10, 15, 20]boxplot!(plt_prior_1, [x], y_prior[:, x+1], color=:blue)endplot!(plt_prior_1, size=(600, 550))plt_prior_1
Figure 2: Prior predictive plot for regression example.
Proposal Distribution and Initial Value
To illustrate how the M-H algorithm works, let’s use a proposal \[\mathcal{N}(x_t, 0.01I_3).\]
And let’s start at \[x_0 = \begin{pmatrix}1 \\ 1 \\ 1\end{pmatrix}\].
Run multiple chains from “overdispersed” starting points
Compare intra-chain and inter-chain variances
Summarized as \(\hat{R}\) statistic: closer to 1 implies better convergence.
Can also check distributions across multiple chains vs. the half-chain check.
On Multiple Chains
Unless a specific scheme is used, multiple chains are not a solution for issues of convergence, as each individual chain needs to converge and have burn-in discarded/watered-down.
This means multiple chains are more useful for diagnostics, but once they’ve all been run long enough, can mix samples freely.
Heuristics for Convergence
If you’re more interested in the mean estimate, can also look at the its stability by iteration or the Monte Carlo standard error.
Look at traceplots; do you see sudden “jumps”?
When in doubt, run the chain longer.
Increasing Efficiency
Adaptive Metropolis-Hastings
Adjust proposal density to hit target acceptance rate.
Need to be cautious about detailed balance.
Typical strategy is to adapt for a portion of the initial chain (part of the burn-in), then run longer with that proposal.
Hamiltonian Monte Carlo
Idea: Use proposals which steer towards “typical set” without collapsing towards the mode (based on Hamiltonian vector field);
Requires gradient information: can be obtained through autodifferentiation; challenging for external models;
Can be very efficient due to potential for anti-correlated samples, but very sensitive to parameterization.
Will see how to use this next class.
Key Points, Upcoming Schedule, and References
Key Points (Metropolis-Hastings)
Construct ergodic and reversible Markov chains with posterior as stationary distribution.
Metropolis-Hastings: conceptually simple algorithm, but implementation plays a major role.
Proposal distribution plays a large role in acceptance rate and effective sample size.
Key Points (Convergence)
Must rely on “accumulation of evidence” from heuristics for determination about convergence to stationary distribution.
Transient portion of chain: Meh. Some people worry about this too much. Discard or run the chain longer.
Parallelizing solves few problems, but running multiple chains can be useful for diagnostics.
Next Classes
Wednesday: MCMC Examples
Monday: MCMC Lab (No exercises these weeks)
Next Wednesday: Literature Presentations (email slides by 9pm Tuesday night).
Assessments
Homework 3: Due 3/22
References
Gelman, A., & Rubin, D. B. (1992). Inference from Iterative Simulation Using Multiple Simulations. Stat. Sci., 7, 457–511. https://doi.org/10.1214/ss/1177011136
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. J. Chem. Phys., 21, 1087–1092. https://doi.org/10.1063/1.1699114